Summary

This project uses data from the 1974 Motor Trend US magazine (comprising fuel consumption and 10 aspects of automobile design and performance for 32 automobiles) to:


Environment

if (!require("pacman")) install.packages("pacman"); library(pacman)
pacman::p_load(dplyr, corrplot,  ggplot2, ggthemr, plotly)


Data

load data

data("mtcars")


view first few rows

head(mtcars, 4)
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1


The data has 32 observations of the following variables:


check for missingness

sum(is.na(mtcars))
## [1] 0


transforming numeric variables to factors

mydata <- mtcars %>% mutate_at(c("cyl", "vs", "am", "gear", "carb"), as.factor)


Exploratory Analysis


Fig 1: Correlation Plot

corrplot.mixed(cor(mtcars), order="hclust", tl.col="black")

Observations:


Fig 2: Box plot: Fuel by Transmission

ggplot(mydata, aes(x=am, y=mpg, fill = am)) + 
  geom_boxplot() + 
  labs(x = "Type", y = "Miles per Gallon") +
  scale_fill_discrete(labels = c("Automatic", "Manual"))

Observation: Manual cars in the sample have better fuel efficiency i.e. travel farther on comparable fuel than automatic cars.


how much farther?

auto <- mydata[mydata$am == 0,];man <- mydata[mydata$am == 1,]
mean(man$mpg) - mean(auto$mpg)
## [1] 7.244939


is this difference unique to our sample?

t.test(man$mpg, auto$mpg, alternative = c("two.sided"))
## 
##  Welch Two Sample t-test
## 
## data:  man$mpg and auto$mpg
## t = 3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   3.209684 11.280194
## sample estimates:
## mean of x mean of y 
##  24.39231  17.14737

Observations:


Fig 3: Box plot: Fuel by Transmission|Number of Cylinders

ggthemr("dust")

bp <- ggplot(mydata, aes(x=am, y=mpg, fill = am)) + 
    geom_boxplot() +
    facet_wrap(. ~ cyl, labeller = "label_both")

ggplotly(bp)

Observation: As the number of cylinder increases, the difference in fuel consumption between automatic and manual transmission cars shrinks.


Modeling

The goal is to gauge the effect of transmission type on fuel consumption after accounting for other predictors. And to investigate the best overall predictors of fuel consumption.


The model with only \(am\) as a predictor is the baseline.

baseline <- lm(mpg ~ am, data = mydata)
summary(baseline)
## 
## Call:
## lm(formula = mpg ~ am, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am1            7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

Observations:


Variable Selection

We use stepwise regression as a preliminary way of identifying important variables.

full <- lm(mpg ~., mydata)
select <- MASS::stepAIC(full, direction = "both", trace = FALSE)

The stepwise method identifies \(cyl\), \(hp\), \(wt\) and \(am\) as potential predictors.


Model Building

summary(select)$adj.r.squared
## [1] 0.8400875

A model with these variables is much more informative (50% more explanatory power) but we still need to decide what combination of these predictors yields the best model.


To do so I fit a (partial) second order model and eliminate insignificant variables. This yields:

chosen <- lm(mpg ~ hp + wt + I(hp^2) + I(wt^2), data = mydata)
summary(chosen)
## 
## Call:
## lm(formula = mpg ~ hp + wt + I(hp^2) + I(wt^2), data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8849 -1.8165 -0.3922  1.3499  4.5807 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.945e+01  3.521e+00  14.044 6.27e-14 ***
## hp          -9.428e-02  3.193e-02  -2.952 0.006456 ** 
## wt          -9.220e+00  2.270e+00  -4.062 0.000375 ***
## I(hp^2)      1.743e-04  8.073e-05   2.159 0.039879 *  
## I(wt^2)      8.500e-01  3.005e-01   2.829 0.008700 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.135 on 27 degrees of freedom
## Multiple R-squared:  0.8907, Adjusted R-squared:  0.8745 
## F-statistic: 55.02 on 4 and 27 DF,  p-value: 1.363e-12

This model has fairly high explanatory power (87.5%) with few, all-significant predictors.


NB Adding transmission type does not improve the model’s R-squared, in fact the model is penalized for having an extraneous variable.

summary(lm(mpg ~ am + hp + wt + I(hp^2) + I(wt^2), data = mydata))$adj.r.squared
## [1] 0.8699158

This implies that knowing a car’s weight and horsepower yields the most information about its fuel consumption.


Diagnostics

Before proceeding, we check that the assumptions for linear regression aren’t grossly violated.

par(mfrow=c(2,2)) 
plot(chosen)

par(mfrow=c(1,1))


Zero Mean and Equal Variance of Errors

The Residuals vs Fitted shows residuals with a roughly zero mean zero and more or less equal spread, suggesting an unbiased model.


Normality of Errors

The points in the top-right plot seem to have deviations from normality (they do not lie on the diagonal line). We perform a secondary check for normality:

shapiro.test(summary(chosen)$resid)
## 
##  Shapiro-Wilk normality test
## 
## data:  summary(chosen)$resid
## W = 0.94267, p-value = 0.08918

The null hypothesis of this test is that the errors are normally distributed. With a p-value of 0.09, we fail to reject the null i.e. we find no evidence that normality is violated.


Influential Outliers

The Residuals vs Leverage indicate the possibility of leverage points. To gauge influence, we look at the variables with Cook’s Distance > 4/n

which(cooks.distance(chosen) > 4/nrow(mydata))
## 16 17 18 20 
## 16 17 18 20

However, these values do not appear to have been recorded in error and deleting them doesn’t significantly improve the model. So we stick to the chosen model.


Conclusion

chosen
## 
## Call:
## lm(formula = mpg ~ hp + wt + I(hp^2) + I(wt^2), data = mydata)
## 
## Coefficients:
## (Intercept)           hp           wt      I(hp^2)      I(wt^2)  
##  49.4530494   -0.0942795   -9.2203220    0.0001743    0.8500054


Final Observations:

__________________________________________________________END__________________________________________________________